You may add as many code and markdown cells as you see fit to answer the questions.
You will demonstrate your ability to merge, group, summarize, visualize, and find patterns in data. This exam uses data associated with a manufacturing example. An overview of the goals, considerations, CSV files, and variables within the data is provided in a presentation on Canvas. Please read through those slides before starting the exam.
The data are provided in 5 separate CSV files. The CSV files are available on Canvas. You MUST download the files and save them to the same working directory as this notebook.
The specific instructions in this notebook tell you when you must JOIN the data together. Please read the problems carefully.
The overall objective of this exam is to JOIN data from multiple files in order to explore and find interesting patterns between the machine operating conditions and supplier information. You will report your findings within this notebook by displaying Pandas DataFrames and statistical visualizations via Seaborn and matplotlib when necessary.
You are permitted to use the following modules on this exam.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
You may also use the following functions from scikit-learn on this exam.
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
You may also use the following sub module from SCIPY.
from scipy.cluster import hierarchy
sns.set_palette("colorblind")
You are NOT permitted to use any other modules or functions. However, you ARE permitted to create your own user defined functions if you would like.
The file names for the 3 machine data sets are provided as strings in the cell below. You are required to read in the CSV files and assign the data to the m01_df, m02_df, and m03_df objects. The data from machine 1 will therefore be associated with m01_df, machine 2 is associated with m02_df, and machine 3 is associated with m03_df.
In this problem you must explore each of the three machine data sets.
You must perform the following ESSENTIAL activities:
You must visually explore the MARGINAL behavior of the variables in the data. You must use visualizations appropriate for the DATA TYPE of the columns.
You must visually explore RELATIONSHIPS between variables in the data. You must use visualizations appropriate for the DATA TYPES. You must make sure that your visualizations can answer the following questions:
Batch are associated with each MACHINE (data set)?Batch value for each MACHINE (data set)?x1 through x4 vary across the three MACHINES?x1 through x4 vary across the Batch values?x1 through x4 vary across the three MACHINES?x1 through x4 vary across the Batch values?At the conclusion of this problem, you MUST CONCATENATE the 3 MACHINE data sets into a single DataFrame. The single DataFrame must be named machine_df. Before concatenating, you MUST add a column machine_id to each DataFrame with the correct index value for that machine (1, 2, or 3). The concatenating DataFrame variable name is provided as a reminder to you below.
You may add as many markdown and code cells as you see fit to answer this question. Include markdown cells stating what you see in the figures and why you selected to use them.
# Define the files's for the 3 machine level CSV files
file_m01 = 'midterm_machine_01.csv'
file_m02 = 'midterm_machine_02.csv'
file_m03 = 'midterm_machine_03.csv'
# read in the CSV files and name them accordingly
m01_df = pd.read_csv(file_m01)
m01_df.name = "m01_df"
m02_df = pd.read_csv(file_m02)
m02_df.name = "m02_df"
m03_df = pd.read_csv(file_m03)
m03_df.name = "m03_df"
# shape of the dataframes
for mxx_df in [m01_df, m02_df, m03_df]:
print(f'Shape of {mxx_df.name} is {mxx_df.shape[0]} rows and {mxx_df.shape[1]} columns')
print('-'*43)
Shape of m01_df is 5152 rows and 7 columns ------------------------------------------- Shape of m02_df is 5119 rows and 7 columns ------------------------------------------- Shape of m03_df is 4458 rows and 7 columns -------------------------------------------
for mxx_df in [m01_df, m02_df, m03_df]:
print(f'Names and Data types for each column in {mxx_df.name}:')
mxx_df.info()
print('-'*30)
Names and Data types for each column in m01_df: <class 'pandas.core.frame.DataFrame'> RangeIndex: 5152 entries, 0 to 5151 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5152 non-null object 1 Batch 5152 non-null int64 2 s_id 5152 non-null int64 3 x1 5152 non-null float64 4 x2 5152 non-null float64 5 x3 5152 non-null float64 6 x4 5152 non-null float64 dtypes: float64(4), int64(2), object(1) memory usage: 281.9+ KB ------------------------------ Names and Data types for each column in m02_df: <class 'pandas.core.frame.DataFrame'> RangeIndex: 5119 entries, 0 to 5118 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5119 non-null object 1 Batch 5119 non-null int64 2 s_id 5119 non-null int64 3 x1 5119 non-null float64 4 x2 5119 non-null float64 5 x3 5119 non-null float64 6 x4 5119 non-null float64 dtypes: float64(4), int64(2), object(1) memory usage: 280.1+ KB ------------------------------ Names and Data types for each column in m03_df: <class 'pandas.core.frame.DataFrame'> RangeIndex: 4458 entries, 0 to 4457 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 4458 non-null object 1 Batch 4458 non-null int64 2 s_id 4458 non-null int64 3 x1 4458 non-null float64 4 x2 4458 non-null float64 5 x3 4458 non-null float64 6 x4 4458 non-null float64 dtypes: float64(4), int64(2), object(1) memory usage: 243.9+ KB ------------------------------
for mxx_df in [m01_df, m02_df, m03_df]:
print(f'Missing Values in {mxx_df.name}:\n{mxx_df.isna().sum()}')
print('-'*30)
Missing Values in m01_df: ID 0 Batch 0 s_id 0 x1 0 x2 0 x3 0 x4 0 dtype: int64 ------------------------------ Missing Values in m02_df: ID 0 Batch 0 s_id 0 x1 0 x2 0 x3 0 x4 0 dtype: int64 ------------------------------ Missing Values in m03_df: ID 0 Batch 0 s_id 0 x1 0 x2 0 x3 0 x4 0 dtype: int64 ------------------------------
First convert the variables ID, Batch and s_id to categorical
for mxx_df in [m01_df, m02_df, m03_df]:
mxx_df['Batch'] = mxx_df['Batch'].astype('object')
mxx_df['ID'] = mxx_df['ID'].astype('object')
mxx_df['s_id'] = mxx_df['s_id'].astype('object')
m01_df¶m01_df.describe()
| x1 | x2 | x3 | x4 | |
|---|---|---|---|---|
| count | 5152.000000 | 5152.000000 | 5152.000000 | 5152.000000 |
| mean | 51.998218 | 102.207011 | 23.947216 | 10.628436 |
| std | 3.151912 | 12.888171 | 2.269325 | 1.542291 |
| min | 44.275379 | 65.810729 | 19.095832 | 7.644190 |
| 25% | 49.335776 | 96.610288 | 22.334048 | 9.479969 |
| 50% | 52.536206 | 104.487824 | 23.855690 | 10.356002 |
| 75% | 54.662490 | 110.609442 | 25.448356 | 11.318120 |
| max | 58.647186 | 130.228828 | 29.477213 | 14.612911 |
m01_df.describe(include='object')
| ID | Batch | s_id | |
|---|---|---|---|
| count | 5152 | 5152 | 5152 |
| unique | 5152 | 50 | 149 |
| top | B001-M01-S001 | 36 | 1 |
| freq | 1 | 149 | 50 |
sns.catplot( data = m01_df, x='ID', kind='count', aspect=3 )
plt.show()
ID is unique for the cell phone case. The above plot tells us that each row in the dataset represents one unique cell phone case
h = sns.catplot(data=m01_df, x='Batch', kind='count', aspect=3)
h.fig.subplots_adjust(top=0.9)
for ax in h.axes.ravel():
# add annotations
for c in ax.containers:
labels = [f'{(v.get_height()):.0f}' for v in c]
ax.bar_label(c, labels=labels, label_type='edge')
ax.margins(y=0.2)
plt.tight_layout()
plt.show()
Above plot shows number of cell phone cases manufactured per Batch
sns.catplot( data = m01_df, x='s_id', kind='count', aspect=3)
<seaborn.axisgrid.FacetGrid at 0x7fcf82b83370>
s_id The sequential production index for a single cell phone case within a Batch on a machine
First will create a LONG FORMAT representation of the given machine dataset
df_features = m01_df.select_dtypes('number').copy()
df_objects = m01_df.select_dtypes('object').copy()
id_cols = ['rowid'] + df_objects.columns.to_list()
m01_df_lf = m01_df.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=id_cols, value_vars=df_features.columns)
m01_df_lf
| rowid | ID | Batch | s_id | variable | value | |
|---|---|---|---|---|---|---|
| 0 | 0 | B001-M01-S001 | 1 | 1 | x1 | 50.117118 |
| 1 | 1 | B001-M01-S002 | 1 | 2 | x1 | 46.887333 |
| 2 | 2 | B001-M01-S003 | 1 | 3 | x1 | 50.132744 |
| 3 | 3 | B001-M01-S004 | 1 | 4 | x1 | 48.501042 |
| 4 | 4 | B001-M01-S005 | 1 | 5 | x1 | 49.690442 |
| ... | ... | ... | ... | ... | ... | ... |
| 20603 | 5147 | B050-M01-S105 | 50 | 105 | x4 | 11.456779 |
| 20604 | 5148 | B050-M01-S106 | 50 | 106 | x4 | 11.048374 |
| 20605 | 5149 | B050-M01-S107 | 50 | 107 | x4 | 10.697194 |
| 20606 | 5150 | B050-M01-S108 | 50 | 108 | x4 | 11.052844 |
| 20607 | 5151 | B050-M01-S109 | 50 | 109 | x4 | 11.190619 |
20608 rows × 6 columns
sns.displot(data = m01_df_lf, x='value', col='variable', kind='kde',
facet_kws={'sharex': False, 'sharey': False},
common_norm=False)
plt.show()
Do the summary statistics of the OPERATING VARIABLES x1 through x4 vary across the Batch values?
sns.catplot(data = m01_df_lf, x='Batch', y='value', col='variable', col_wrap=1, aspect=3,
kind='box',
sharey=False)
plt.show()
Based on the above plot, yes, summary statistics of the OPERATING VARIABLES x1 through x4 vary across the Batch values
Do the relationships between the OPERATING VARIABLES x1 through x4 vary across the Batch values?
sns.pairplot(data=m01_df[['x1', 'x2', 'x3', 'x4', 'Batch']],
hue='Batch',
diag_kws={'common_norm': False},
palette='viridis')
plt.show()
the_groups = m01_df.Batch.unique().tolist()
corr_per_group = m01_df.loc[ :, ['Batch', 'x1', 'x2', 'x3', 'x4']].groupby(['Batch']).corr()
subgroups = [the_groups[i:i + 5] for i in range(0, len(the_groups), 5)] # Done so that the plot isn't cluttered
for sg in subgroups:
fig, axs = plt.subplots(1, len(sg), figsize=(18, 5), sharex=True, sharey=True )
for i in range(len(sg)):
sns.heatmap( data = corr_per_group.loc[ sg[ i ] ],
vmin=-1, vmax=1, center = 0,
cmap='coolwarm', cbar=False,
annot=True, annot_kws={'size': 10},
ax=axs[i] )
axs[i].set_title('Batch: %s' % sg[ i ] )
plt.show()
Based on the above pair plot and correlation plot we can confirm that the relationships between the operating variables vary across Batch values
m02_df¶m02_df.describe()
| x1 | x2 | x3 | x4 | |
|---|---|---|---|---|
| count | 5119.000000 | 5119.000000 | 5119.000000 | 5119.000000 |
| mean | 51.989377 | 101.998905 | 23.982312 | 10.605275 |
| std | 3.095129 | 12.834348 | 2.275809 | 1.523140 |
| min | 44.896759 | 64.774007 | 19.232164 | 7.934064 |
| 25% | 49.240567 | 96.325502 | 22.372944 | 9.492226 |
| 50% | 52.610949 | 104.357308 | 23.894465 | 10.329911 |
| 75% | 54.785418 | 110.672960 | 25.546656 | 11.286062 |
| max | 58.503441 | 131.009046 | 29.093229 | 14.400437 |
m02_df.describe(include='object')
| ID | Batch | s_id | |
|---|---|---|---|
| count | 5119 | 5119 | 5119 |
| unique | 5119 | 50 | 133 |
| top | B001-M02-S001 | 43 | 1 |
| freq | 1 | 133 | 50 |
sns.catplot( data = m02_df, x='ID', kind='count', aspect=3 )
plt.show()
ID is unique for the cell phone case. The above plot tells us that each row in the dataset represents one unique cell phone case
h = sns.catplot(data=m02_df, x='Batch', kind='count', aspect=3)
h.fig.subplots_adjust(top=0.9)
for ax in h.axes.ravel():
# add annotations
for c in ax.containers:
labels = [f'{(v.get_height()):.0f}' for v in c]
ax.bar_label(c, labels=labels, label_type='edge')
ax.margins(y=0.2)
plt.tight_layout()
plt.show()
Above plot shows number of cell phone cases manufactured per Batch
sns.catplot( data = m02_df, x='s_id', kind='count', aspect=3)
<seaborn.axisgrid.FacetGrid at 0x7fcf85f9b0d0>
s_id The sequential production index for a single cell phone case within a Batch on a machine
First will create a LONG FORMAT representation of the given machine dataset
df_features = m02_df.select_dtypes('number').copy()
df_objects = m02_df.select_dtypes('object').copy()
id_cols = ['rowid'] + df_objects.columns.to_list()
m02_df_lf = m02_df.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=id_cols, value_vars=df_features.columns)
m02_df_lf
| rowid | ID | Batch | s_id | variable | value | |
|---|---|---|---|---|---|---|
| 0 | 0 | B001-M02-S001 | 1 | 1 | x1 | 50.213596 |
| 1 | 1 | B001-M02-S002 | 1 | 2 | x1 | 49.811232 |
| 2 | 2 | B001-M02-S003 | 1 | 3 | x1 | 48.968142 |
| 3 | 3 | B001-M02-S004 | 1 | 4 | x1 | 50.477133 |
| 4 | 4 | B001-M02-S005 | 1 | 5 | x1 | 50.188501 |
| ... | ... | ... | ... | ... | ... | ... |
| 20471 | 5114 | B050-M02-S092 | 50 | 92 | x4 | 10.963370 |
| 20472 | 5115 | B050-M02-S093 | 50 | 93 | x4 | 11.156604 |
| 20473 | 5116 | B050-M02-S094 | 50 | 94 | x4 | 11.303222 |
| 20474 | 5117 | B050-M02-S095 | 50 | 95 | x4 | 11.084512 |
| 20475 | 5118 | B050-M02-S096 | 50 | 96 | x4 | 10.886910 |
20476 rows × 6 columns
sns.displot(data = m02_df_lf, x='value', col='variable', kind='kde',
facet_kws={'sharex': False, 'sharey': False},
common_norm=False)
plt.show()
Do the summary statistics of the OPERATING VARIABLES x1 through x4 vary across the Batch values?
sns.catplot(data = m02_df_lf, x='Batch', y='value', col='variable', col_wrap=1, aspect=3,
kind='box',
sharey=False)
plt.show()
Based on the above plot, yes, summary statistics of the OPERATING VARIABLES x1 through x4 vary across the Batch values
Do the relationships between the OPERATING VARIABLES x1 through x4 vary across the Batch values?
sns.pairplot(data=m02_df[['x1', 'x2', 'x3', 'x4', 'Batch']],
hue='Batch',
diag_kws={'common_norm': False},
palette='viridis')
plt.show()
the_groups = m02_df.Batch.unique().tolist()
corr_per_group = m02_df.loc[ :, ['Batch', 'x1', 'x2', 'x3', 'x4']].groupby(['Batch']).corr()
subgroups = [the_groups[i:i + 5] for i in range(0, len(the_groups), 5)] # Done so that the plot isn't cluttered
for sg in subgroups:
fig, axs = plt.subplots(1, len(sg), figsize=(18, 5), sharex=True, sharey=True )
for i in range(len(sg)):
sns.heatmap( data = corr_per_group.loc[ sg[ i ] ],
vmin=-1, vmax=1, center = 0,
cmap='coolwarm', cbar=False,
annot=True, annot_kws={'size': 10},
ax=axs[i] )
axs[i].set_title('Batch: %s' % sg[ i ] )
plt.show()
Based on the above pair plot and correlation plot we can confirm that the relationships between the operating variables vary across Batch values
m03_df¶m03_df.describe()
| x1 | x2 | x3 | x4 | |
|---|---|---|---|---|
| count | 4458.000000 | 4458.000000 | 4458.000000 | 4458.000000 |
| mean | 51.857672 | 101.650057 | 23.927679 | 10.602371 |
| std | 3.022646 | 12.243012 | 2.035196 | 1.554708 |
| min | 45.690397 | 70.192150 | 19.865338 | 8.155668 |
| 25% | 49.228020 | 95.381831 | 22.457249 | 9.475467 |
| 50% | 52.598712 | 104.712727 | 23.900193 | 10.351781 |
| 75% | 54.810528 | 110.050211 | 25.142293 | 11.301126 |
| max | 56.981029 | 126.332105 | 28.972747 | 14.182623 |
m03_df.describe(include='object')
| ID | Batch | s_id | |
|---|---|---|---|
| count | 4458 | 4458 | 4458 |
| unique | 4458 | 41 | 141 |
| top | B001-M03-S001 | 5 | 1 |
| freq | 1 | 141 | 41 |
sns.catplot( data = m03_df, x='ID', kind='count', aspect=3 )
plt.show()
ID is unique for the cell phone case. The above plot tells us that each row in the dataset represents one unique cell phone case
h = sns.catplot(data=m03_df, x='Batch', kind='count', aspect=3)
h.fig.subplots_adjust(top=0.9)
for ax in h.axes.ravel():
# add annotations
for c in ax.containers:
labels = [f'{(v.get_height()):.0f}' for v in c]
ax.bar_label(c, labels=labels, label_type='edge')
ax.margins(y=0.2)
plt.tight_layout()
plt.show()
Above plot shows number of cell phone cases manufactured per Batch
sns.catplot( data = m03_df, x='s_id', kind='count', aspect=3)
<seaborn.axisgrid.FacetGrid at 0x7fcf8848f6d0>
s_id The sequential production index for a single cell phone case within a Batch on a machine
First will create a LONG FORMAT representation of the given machine dataset
df_features = m03_df.select_dtypes('number').copy()
df_objects = m03_df.select_dtypes('object').copy()
id_cols = ['rowid'] + df_objects.columns.to_list()
m03_df_lf = m03_df.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=id_cols, value_vars=df_features.columns)
m03_df_lf
| rowid | ID | Batch | s_id | variable | value | |
|---|---|---|---|---|---|---|
| 0 | 0 | B001-M03-S001 | 1 | 1 | x1 | 49.317142 |
| 1 | 1 | B001-M03-S002 | 1 | 2 | x1 | 49.983523 |
| 2 | 2 | B001-M03-S003 | 1 | 3 | x1 | 48.762702 |
| 3 | 3 | B001-M03-S004 | 1 | 4 | x1 | 50.009580 |
| 4 | 4 | B001-M03-S005 | 1 | 5 | x1 | 48.892358 |
| ... | ... | ... | ... | ... | ... | ... |
| 17827 | 4453 | B049-M03-S100 | 49 | 100 | x4 | 13.753264 |
| 17828 | 4454 | B049-M03-S101 | 49 | 101 | x4 | 13.636539 |
| 17829 | 4455 | B049-M03-S102 | 49 | 102 | x4 | 13.678206 |
| 17830 | 4456 | B049-M03-S103 | 49 | 103 | x4 | 13.880084 |
| 17831 | 4457 | B049-M03-S104 | 49 | 104 | x4 | 13.853014 |
17832 rows × 6 columns
sns.displot(data = m03_df_lf, x='value', col='variable', kind='kde',
facet_kws={'sharex': False, 'sharey': False},
common_norm=False)
plt.show()
Do the summary statistics of the OPERATING VARIABLES x1 through x4 vary across the Batch values?
sns.catplot(data = m03_df_lf, x='Batch', y='value', col='variable', col_wrap=1, aspect=3,
kind='box',
sharey=False)
plt.show()
Based on the above plot, yes, summary statistics of the OPERATING VARIABLES x1 through x4 vary across the Batch values
Do the relationships between the OPERATING VARIABLES x1 through x4 vary across the Batch values?
sns.pairplot(data=m03_df[['x1', 'x2', 'x3', 'x4', 'Batch']],
hue='Batch',
diag_kws={'common_norm': False},
palette='viridis')
plt.show()
the_groups = m03_df.Batch.unique().tolist()
corr_per_group = m03_df.loc[ :, ['Batch', 'x1', 'x2', 'x3', 'x4']].groupby(['Batch']).corr()
subgroups = [the_groups[i:i + 5] for i in range(0, len(the_groups) - 1, 5)] # Done so that the plot isn't cluttered
subgroups
[[1, 2, 3, 4, 5], [6, 9, 12, 13, 14], [15, 16, 17, 19, 20], [21, 22, 23, 24, 26], [27, 28, 29, 31, 32], [33, 34, 35, 36, 37], [38, 40, 41, 42, 43], [44, 45, 46, 47, 48]]
the_groups = m03_df.Batch.unique().tolist()
corr_per_group = m03_df.loc[ :, ['Batch', 'x1', 'x2', 'x3', 'x4']].groupby(['Batch']).corr()
subgroups = [the_groups[i:i + 5] for i in range(0, len(the_groups), 5)] # Done so that the plot isn't cluttered
for sg in subgroups:
if len(sg) < 5:
fig, axs = plt.subplots(1, 5, figsize=(18, 5), sharex=True, sharey=True )
else:
fig, axs = plt.subplots(1, len(sg), figsize=(18, 5), sharex=True, sharey=True )
for i in range(len(sg)):
sns.heatmap( data = corr_per_group.loc[ sg[ i ] ],
vmin=-1, vmax=1, center = 0,
cmap='coolwarm', cbar=False,
annot=True, annot_kws={'size': 10},
ax=axs[i] )
axs[i].set_title('Batch: %s' % sg[ i ] )
plt.show()
Based on the above pair plot and correlation plot we can confirm that the relationships between the operating variables vary across Batch values
Relationships between machine datasets
m01_df['machine_id'] = 1
m02_df['machine_id'] = 2
m03_df['machine_id'] = 3
machine_df = pd.concat([m01_df, m02_df, m03_df])
machine_df['Batch']=machine_df.Batch.astype('object')
machine_df['s_id']=machine_df.s_id.astype('object')
machine_df['ID']=machine_df.ID.astype('object')
machine_df['machine_id']=machine_df.machine_id.astype('object')
machine_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 14729 entries, 0 to 4457 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 14729 non-null object 1 Batch 14729 non-null object 2 s_id 14729 non-null object 3 x1 14729 non-null float64 4 x2 14729 non-null float64 5 x3 14729 non-null float64 6 x4 14729 non-null float64 7 machine_id 14729 non-null object dtypes: float64(4), object(4) memory usage: 1.0+ MB
df_features = machine_df.select_dtypes('number').copy()
df_objects = machine_df.select_dtypes('object').copy()
id_cols = ['rowid'] + df_objects.columns.to_list()
machine_df_lf = machine_df.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=id_cols, value_vars=df_features.columns)
machine_df_lf. head()
| rowid | ID | Batch | s_id | machine_id | variable | value | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | B001-M01-S001 | 1 | 1 | 1 | x1 | 50.117118 |
| 1 | 1 | B001-M01-S002 | 1 | 2 | 1 | x1 | 46.887333 |
| 2 | 2 | B001-M01-S003 | 1 | 3 | 1 | x1 | 50.132744 |
| 3 | 3 | B001-M01-S004 | 1 | 4 | 1 | x1 | 48.501042 |
| 4 | 4 | B001-M01-S005 | 1 | 5 | 1 | x1 | 49.690442 |
How many unique values for Batch are associated with each MACHINE (data set)?
fig, ax = plt.subplots()
ax = sns.barplot(machine_df.groupby('machine_id').Batch.nunique())
ax.bar_label(ax.containers[0], fontsize=10);
There are 50 unique values for Batch in machine 1, machine 2 and 41 in the machine 3
How many cell phone cases are associated with each Batch value for each MACHINE (data set)?
g = sns.catplot(data=machine_df, x='Batch', kind='count', col='machine_id', col_wrap=1, aspect=3,)
g.fig.subplots_adjust(top=0.9)
# iterate through axes
for ax in g.axes.ravel():
# add annotations
for c in ax.containers:
labels = [f'{(v.get_height()):.0f}' for v in c]
ax.bar_label(c, labels=labels, label_type='edge')
ax.margins(y=0.2)
plt.tight_layout()
plt.show()
Do the summary statistics of the OPERATING VARIABLES x1 through x4 vary across the three MACHINES?
sns.catplot(data = machine_df_lf, hue='machine_id', col='variable', y='value',
kind='box',
sharey=False, showmeans=True,
meanprops={"marker":"s","markerfacecolor":"white", "markeredgecolor":"blue"}
)
# plt.subplots_adjust(hspace=0.5)
plt.tight_layout()
plt.show()
sns.catplot(data = machine_df_lf, hue='machine_id', col='variable', y='value',
kind='point',
linestyle='none'
)
# plt.subplots_adjust(hspace=0.5)
plt.tight_layout()
plt.show()
🛎️ The summary statistics of the OPERATING VARIABLES x1 through x4 vary among themselves but DOES NOT VARY across the THREE MACHINES
Do the relationships between the OPERATING VARIABLES x1 through x4 vary across the three MACHINES?
the_groups_machine = machine_df.machine_id.unique().tolist()
corr_per_group_machine = machine_df.loc[ :, ['x1', 'x2', 'x3', 'x4', 'machine_id']].groupby(['machine_id']).corr()
fig, axs = plt.subplots(1, len(the_groups_machine), figsize=(15, 5), sharex=True, sharey=True)
for i in range(len(the_groups_machine)):
sns.heatmap(data = corr_per_group_machine.loc[the_groups_machine[i]],
vmin=-1, vmax=1, center = 0,
cmap='coolwarm', cbar=False,
annot=True, annot_kws={'size': 10},
ax=axs[i])
axs[i].set_title('machine_id: %s' % the_groups_machine[i] )
plt.show()
Few observations:
x1 and x2 have strong positive correlation across all three machinesx3 and x4 have moderate negative correlation across all three machinex1 and x3 correlation vary across three machinex1 and x4 correlation vary across three machinex2 and x3 correlation vary across three machinesns.pairplot(data = machine_df[['x1','x2','x3','x4','machine_id']],
hue='machine_id',
diag_kws={'common_norm': False})
plt.show()
🛎️ Yes, the relationships between the OPERATING VARIABLES x1 through x4 vary across the three MACHINES
The supplier batch data set file name is provided for you below. You must read in the CSV file and assign the data set to the batch_df object.
You must perform the following ESSENTIAL activities:
You must visually explore the MARGINAL behavior of the variables in the data. You must use visualizations appropriate for the DATA TYPE of the columns.
You must visually explore RELATIONSHIPS between variables in the data. You must use visualizations appropriate for the DATA TYPES. You must make sure that your visualizations can answer the following questions:
Density depend on the Supplier?Density depend on the Supplier?Density relate to Batch for each Supplier?After exploring the batch_df DataFrame, you MUST JOIN/MERGE the batch_df DataFrame with the machine_df DataFrame. Assign the merged DataFrame to the dfa DataFrame.
You can now explore the relationships between the MACHINE OPERATIONAL VARIABLES and the SUPPLIERS! You must use visualizations to explore the following relationships:
x1 through x4 vary across Batch for each MACHINE given each Supplier. Your figures MUST use Batch as the x-axis variable.x1 through x4 vary across Supplier.You may add as many markdown and code cells as you see fit to answer this question.
# define the batch supplier file
batch_file = 'midterm_supplier.csv'
# read in the batch supplier data set
batch_df = pd.read_csv(batch_file)
print(f'There are {batch_df.shape[0]} rows and {batch_df.shape[1]} columns')
There are 50 rows and 3 columns
batch_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50 entries, 0 to 49 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Batch 50 non-null int64 1 Supplier 50 non-null object 2 Density 50 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 1.3+ KB
There are three columns
Batch is a number but as given in the problem statement will convert that to a categoricalSupplier is a categoricalDensity is a numberbatch_df['Batch'] = batch_df['Batch'].astype('object')
batch_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50 entries, 0 to 49 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Batch 50 non-null object 1 Supplier 50 non-null object 2 Density 50 non-null float64 dtypes: float64(1), object(2) memory usage: 1.3+ KB
batch_df.nunique(dropna=False)
Batch 50 Supplier 2 Density 50 dtype: int64
There are 50 unique values for Batch and Density and 2 unique values for Supplier
batch_df.isna().sum()
Batch 0 Supplier 0 Density 0 dtype: int64
There are no missing values
sns.catplot(data=batch_df, x='Batch', kind='count', aspect=3 )
plt.show()
sns.catplot(data=batch_df, x='Supplier', kind='count' )
plt.show()
sns.displot(data=batch_df, x='Density', kind='kde')
plt.show()
Density and Supplier
sns.displot(data = batch_df, x='Density', hue='Supplier',
common_norm=False, kind='kde')
plt.show()
Density has high positive skew (or right skew) for Supplier A and moderate negative skew (or left skew) for Supplier B
Do the summary statistics for Density depend on the Supplier?
sns.catplot(data = batch_df, x='Supplier', y='Density', kind='box',
showmeans=True,
meanprops={'marker': 'o', 'markerfacecolor': 'white', 'markeredgecolor': 'black'})
plt.show()
🛎️ Yes, the SUMMARY STATISTICS of Density depends on the Supplier and the average Density depends on the Supplier
Does the average Density depend on the Supplier?
sns.catplot(data = batch_df, x='Supplier', y='Density', kind='point', linestyle='none')
plt.show()
How does Density relate to Batch for each Supplier?
batch_df['Batch'] = batch_df.Batch.astype('int64') # converting batch to int64 for trend plot
sns.lmplot(data=batch_df, x='Batch', y='Density', col='Supplier')
plt.show()
Observation: Since the TREND LINE is not very steep we cannot establish a clear relationship between Batch and Density across unique values Supplier
the_groups_Supplier = batch_df.Supplier.unique().tolist()
correlation_per_Supplier = batch_df.loc[ :, ['Batch', 'Density', 'Supplier']].groupby(['Supplier']).corr()
fig, axs = plt.subplots(1, len(the_groups_Supplier), figsize=(15, 5), sharex=True, sharey=True)
for i in range(len(the_groups_Supplier)):
sns.heatmap(data = correlation_per_Supplier.loc[the_groups_Supplier[i]],
vmin=-1, vmax=1, center = 0,
cmap='coolwarm', cbar=False,
annot=True, annot_kws={'size': 10},
ax=axs[i])
axs[i].set_title('Supplier: %s' % the_groups_Supplier[i] )
plt.show()
Observation: Negligible correlation. So, no clear relationship can be established
🛎️ No clear relationship can be established
# merge the batch supplier data set with the (concatenated) machine data set
dfa = pd.merge( batch_df, machine_df, on='Batch' )
dfa
| Batch | Supplier | Density | ID | s_id | x1 | x2 | x3 | x4 | machine_id | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | B | 10.388587 | B001-M01-S001 | 1 | 50.117118 | 102.167346 | 22.067812 | 13.889524 | 1 |
| 1 | 1 | B | 10.388587 | B001-M01-S002 | 2 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 |
| 2 | 1 | B | 10.388587 | B001-M01-S003 | 3 | 50.132744 | 103.674908 | 22.319933 | 13.647482 | 1 |
| 3 | 1 | B | 10.388587 | B001-M01-S004 | 4 | 48.501042 | 107.143156 | 22.162947 | 14.077758 | 1 |
| 4 | 1 | B | 10.388587 | B001-M01-S005 | 5 | 49.690442 | 102.120283 | 22.248696 | 13.728666 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 14724 | 50 | B | 9.935676 | B050-M02-S092 | 92 | 52.690936 | 95.050804 | 20.324674 | 10.963370 | 2 |
| 14725 | 50 | B | 9.935676 | B050-M02-S093 | 93 | 52.348015 | 97.863144 | 20.884779 | 11.156604 | 2 |
| 14726 | 50 | B | 9.935676 | B050-M02-S094 | 94 | 52.187241 | 99.887013 | 20.439142 | 11.303222 | 2 |
| 14727 | 50 | B | 9.935676 | B050-M02-S095 | 95 | 52.875621 | 101.572057 | 20.648029 | 11.084512 | 2 |
| 14728 | 50 | B | 9.935676 | B050-M02-S096 | 96 | 52.491445 | 96.744458 | 20.814762 | 10.886910 | 2 |
14729 rows × 10 columns
dfa.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14729 entries, 0 to 14728 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Batch 14729 non-null object 1 Supplier 14729 non-null object 2 Density 14729 non-null float64 3 ID 14729 non-null object 4 s_id 14729 non-null object 5 x1 14729 non-null float64 6 x2 14729 non-null float64 7 x3 14729 non-null float64 8 x4 14729 non-null float64 9 machine_id 14729 non-null object dtypes: float64(5), object(5) memory usage: 1.1+ MB
dfa['Density'] = dfa.Density.astype('object')
dfa_features = dfa.select_dtypes('number').copy()
dfa_objects = dfa.select_dtypes('object').copy()
id_cols = ['rowid'] + dfa_objects.columns.to_list()
dfa_lf = dfa.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=id_cols, value_vars=df_features.columns)
dfa_lf
| rowid | Batch | Supplier | Density | ID | s_id | machine_id | variable | value | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | B | 10.388587 | B001-M01-S001 | 1 | 1 | x1 | 50.117118 |
| 1 | 1 | 1 | B | 10.388587 | B001-M01-S002 | 2 | 1 | x1 | 46.887333 |
| 2 | 2 | 1 | B | 10.388587 | B001-M01-S003 | 3 | 1 | x1 | 50.132744 |
| 3 | 3 | 1 | B | 10.388587 | B001-M01-S004 | 4 | 1 | x1 | 48.501042 |
| 4 | 4 | 1 | B | 10.388587 | B001-M01-S005 | 5 | 1 | x1 | 49.690442 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 58911 | 14724 | 50 | B | 9.935676 | B050-M02-S092 | 92 | 2 | x4 | 10.963370 |
| 58912 | 14725 | 50 | B | 9.935676 | B050-M02-S093 | 93 | 2 | x4 | 11.156604 |
| 58913 | 14726 | 50 | B | 9.935676 | B050-M02-S094 | 94 | 2 | x4 | 11.303222 |
| 58914 | 14727 | 50 | B | 9.935676 | B050-M02-S095 | 95 | 2 | x4 | 11.084512 |
| 58915 | 14728 | 50 | B | 9.935676 | B050-M02-S096 | 96 | 2 | x4 | 10.886910 |
58916 rows × 9 columns
Explore if the summary statistics of the 4 OPERATING VARIABLES x1 through x4 vary across Batch for each MACHINE given each Supplier. Your figures MUST use Batch as the x-axis variable.
sns.catplot(data = dfa_lf, x='Batch', y='value', col='machine_id',
row='variable', hue='Supplier',
kind='box',
sharey=False)
plt.show()
🛎️ Given a Supplier the summary statistics of OPERATING VARIABLE x1 through x4 vary across Batch for each MACHINE
Explore if the relationships between the 4 OPERATING VARIABLES x1 through x4 vary across Supplier.
sns.pairplot(data=dfa[['Supplier', 'x1', 'x2', 'x3', 'x4']],
hue='Supplier',
diag_kws={'common_norm': False})
plt.show()
Observation: Primarily for Supplier A - Higher values of x4 tend to have unique relationships with other three variables
the_groups = dfa.Supplier.unique().tolist()
corr_per_group = dfa.loc[ :, ['Supplier', 'x1', 'x2', 'x3', 'x4']].groupby(['Supplier']).corr()
fig, axs = plt.subplots(1, len(the_groups), figsize=(18, 6), sharex=True, sharey=True )
for ix in range(len(the_groups)):
sns.heatmap( data = corr_per_group.loc[ the_groups[ ix ] ],
vmin=-1, vmax=1, center = 0,
cmap='coolwarm', cbar=False,
annot=True, annot_kws={'size': 20},
ax=axs[ix] )
axs[ ix ].set_title('Supplier: %s' % the_groups[ ix ] )
plt.show()
Observations:
Following relationship is seen across both Suppliers
x1 and x2 are highly correlatedx3 and x4 are moderately correlated🛎️ Based on the above plots, yes, the relationships between the 4 OPERATING VARIABLES x1 through x4 varies across Supplier.
The DROP TEST result data set file name is provided for you below. You must read in the CSV file and assign the dta set to the test_df object.
You must perform the following ESSENTIAL activities:
You must visually explore the MARGINAL behavior of the variables in the data. You must use visualizations appropriate for the DATA TYPE of the columns.
You must visually explore RELATIONSHIPS between variables in the data. You must use visualizations appropriate for the DATA TYPES. You must make sure that your visualizations can answer the following questions:
Result occurs for each test_group_id value.After exploring the test_df DataFrame, you MUST JOIN/MERGE the test_df DataFrame with the dfa DataFrame. Assign the merged DataFrame to the dfb DataFrame. You MUST answer the following:
You may add as many markdown and code cells as you see fit to answer this question.
# define the test data set file name
test_file = 'midterm_test.csv'
# read in the test data set
test_df = pd.read_csv(test_file)
print(f'There are {test_df.shape[0]} rows and {test_df.shape[1]} columns')
There are 1412 rows and 3 columns
test_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1412 entries, 0 to 1411 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 1412 non-null object 1 test_group_id 1412 non-null object 2 Result 1412 non-null int64 dtypes: int64(1), object(2) memory usage: 33.2+ KB
test_df.nunique()
ID 1412 test_group_id 141 Result 2 dtype: int64
There are 1412 unique values for ID, 141 unique values for test_group_id and 2 unique values for Result
test_df.isna().sum()
ID 0 test_group_id 0 Result 0 dtype: int64
No missing values
test_df.describe()
| Result | |
|---|---|
| count | 1412.000000 |
| mean | 0.301700 |
| std | 0.459158 |
| min | 0.000000 |
| 25% | 0.000000 |
| 50% | 0.000000 |
| 75% | 1.000000 |
| max | 1.000000 |
test_df.describe(include='object')
| ID | test_group_id | |
|---|---|---|
| count | 1412 | 1412 |
| unique | 1412 | 141 |
| top | B001-M01-S056 | L-4 |
| freq | 1 | 14 |
sns.catplot(data=test_df,
x='Result',
kind='count')
plt.show()
sns.catplot(data=test_df,
y='test_group_id',
kind='count',
height=20)
plt.show()
sns.catplot(data=test_df,
x='ID',
kind='count',
aspect=5)
plt.show()
Count the number of times each unique value of Result occurs for each test_group_id value.
sns.catplot(data = test_df, y='test_group_id', hue='Result', height=25, kind='count')
plt.show()
fig, ax = plt.subplots(figsize=(10, 60))
sns.heatmap( pd.crosstab( test_df.test_group_id, test_df.Result ), ax = ax,
annot=True, annot_kws={'size': 15}, fmt='d')
plt.show()
🛎️ Above plots show the count of number of times each unique value of Result occurs for each test_group_id. Visualizing a crosstab using heatmap provides better visualizes for these scenarios
# merge test_df with the dfa object
dfb = pd.merge( test_df, dfa, on='ID' )
dfb
| ID | test_group_id | Result | Batch | Supplier | Density | s_id | x1 | x2 | x3 | x4 | machine_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B001-M01-S056 | A-0 | 1 | 1 | B | 10.388587 | 56 | 49.215485 | 105.647327 | 21.410531 | 14.154576 | 1 |
| 1 | B001-M01-S002 | A-0 | 1 | 1 | B | 10.388587 | 2 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 |
| 2 | B001-M01-S049 | A-0 | 1 | 1 | B | 10.388587 | 49 | 49.125629 | 112.323042 | 22.874947 | 13.743788 | 1 |
| 3 | B001-M01-S030 | A-0 | 0 | 1 | B | 10.388587 | 30 | 48.306819 | 109.445722 | 22.910858 | 13.839830 | 1 |
| 4 | B001-M01-S048 | A-0 | 1 | 1 | B | 10.388587 | 48 | 50.564504 | 109.184587 | 22.083064 | 13.954942 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1407 | B049-M03-S026 | K-5 | 1 | 49 | A | 9.111256 | 26 | 49.464765 | 102.967447 | 22.411845 | 13.919089 | 3 |
| 1408 | B049-M03-S008 | K-5 | 1 | 49 | A | 9.111256 | 8 | 49.408191 | 103.516814 | 21.966617 | 13.730119 | 3 |
| 1409 | B049-M03-S041 | K-5 | 1 | 49 | A | 9.111256 | 41 | 49.605196 | 103.460366 | 21.932429 | 13.790280 | 3 |
| 1410 | B049-M03-S061 | K-5 | 1 | 49 | A | 9.111256 | 61 | 49.716703 | 104.346466 | 22.059022 | 13.543388 | 3 |
| 1411 | B049-M03-S051 | K-5 | 1 | 49 | A | 9.111256 | 51 | 49.885679 | 104.690007 | 22.055338 | 13.882302 | 3 |
1412 rows × 12 columns
dfb.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1412 entries, 0 to 1411 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 1412 non-null object 1 test_group_id 1412 non-null object 2 Result 1412 non-null int64 3 Batch 1412 non-null object 4 Supplier 1412 non-null object 5 Density 1412 non-null object 6 s_id 1412 non-null object 7 x1 1412 non-null float64 8 x2 1412 non-null float64 9 x3 1412 non-null float64 10 x4 1412 non-null float64 11 machine_id 1412 non-null object dtypes: float64(4), int64(1), object(7) memory usage: 132.5+ KB
dfb.shape
(1412, 12)
🛎️ After joining 1412 remain
You must now examine the merged dfb object and answer the following:
Result occurs for each value of machine_id.Result occurs for each value of Supplier.Result occurs per Batch for each value of machine_id.Result occurs per Batch for each value of machine_id and Supplier.Batch per machine_id.Batch per machine_id and for each unique value of Supplier.HINT: Remember that a FAILED test is encoded as Result == 1. How can you calculate the PROPORTION of times Result == 1?
Add as many cells as you see fit to answer this question.
Count the number of times each unique value of Result occurs for each value of machine_id.
ax = sns.countplot(x='machine_id', hue='Result', data=dfb)
for container in ax.containers:
ax.bar_label(container)
Count the number of times each unique value of Result occurs for each value of Supplier.
ax = sns.countplot(x='Supplier', hue='Result', data=dfb)
for container in ax.containers:
ax.bar_label(container)
Visualize the number of times each unique value of Result occurs per Batch for each value of machine_id.
g = sns.catplot(data=dfb, x='machine_id', hue='Result', col='Batch', kind='count', col_wrap=4, sharex=False, sharey=False)
g.fig.subplots_adjust(top=0.9)
for ax in g.axes.ravel():
# add annotationsl
for c in ax.containers:
labels = [f'{(v.get_height()):.0f}' for v in c]
ax.bar_label(c, labels=labels, label_type='edge')
ax.margins(y=0.2)
plt.tight_layout()
plt.show()
# print(dfb.groupby(['machine_id', 'Batch']).Result.value_counts().to_string())
Visualize the number of times each unique value of Result occurs per Batch for each value of machine_id and Supplier
g = sns.catplot(data=dfb, x='machine_id', hue='Result',
col='Supplier', row='Batch',
kind='count',
sharex=False,
sharey=False)
g.fig.subplots_adjust(top=0.9)
for ax in g.axes.ravel():
# add annotations
for c in ax.containers:
labels = [f'{(v.get_height()):.0f}' for v in c]
ax.bar_label(c, labels=labels, label_type='edge')
ax.margins(y=0.2)
plt.tight_layout()
plt.show()
Calculate the PROPORTION of times the cell phone case failed the test in each Batch per machine_id
failure_proportion = (dfb.groupby(['Batch', 'machine_id'])['Result']\
.value_counts(normalize=True)\
.unstack(fill_value=0)[1]\
.reset_index(name='failure_proportion')
)
failure_proportion
| Batch | machine_id | failure_proportion | |
|---|---|---|---|
| 0 | 1 | 1 | 0.857143 |
| 1 | 1 | 2 | 1.000000 |
| 2 | 1 | 3 | 1.000000 |
| 3 | 2 | 1 | 0.230769 |
| 4 | 2 | 2 | 0.181818 |
| ... | ... | ... | ... |
| 136 | 49 | 1 | 1.000000 |
| 137 | 49 | 2 | 1.000000 |
| 138 | 49 | 3 | 1.000000 |
| 139 | 50 | 1 | 0.100000 |
| 140 | 50 | 2 | 0.111111 |
141 rows × 3 columns
dfb[(dfb['Batch']==1) & (dfb['machine_id']==1)]
| ID | test_group_id | Result | Batch | Supplier | Density | s_id | x1 | x2 | x3 | x4 | machine_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B001-M01-S056 | A-0 | 1 | 1 | B | 10.388587 | 56 | 49.215485 | 105.647327 | 21.410531 | 14.154576 | 1 |
| 1 | B001-M01-S002 | A-0 | 1 | 1 | B | 10.388587 | 2 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 |
| 2 | B001-M01-S049 | A-0 | 1 | 1 | B | 10.388587 | 49 | 49.125629 | 112.323042 | 22.874947 | 13.743788 | 1 |
| 3 | B001-M01-S030 | A-0 | 0 | 1 | B | 10.388587 | 30 | 48.306819 | 109.445722 | 22.910858 | 13.839830 | 1 |
| 4 | B001-M01-S048 | A-0 | 1 | 1 | B | 10.388587 | 48 | 50.564504 | 109.184587 | 22.083064 | 13.954942 | 1 |
| 5 | B001-M01-S038 | A-0 | 1 | 1 | B | 10.388587 | 38 | 52.058573 | 93.272568 | 21.937216 | 13.332882 | 1 |
| 6 | B001-M01-S024 | A-0 | 1 | 1 | B | 10.388587 | 24 | 51.531574 | 100.207219 | 22.281345 | 13.796810 | 1 |
Visualize the PROPORTION of times the cell phone case failed the test in each Batch per machine_id and for each unique value of Supplier
failure_proportion_supplier = (dfb.groupby(['Batch', 'machine_id', 'Supplier'])['Result']\
.value_counts(normalize=True)\
.unstack(fill_value=0)[1]\
.reset_index(name='failure_proportion_supplier')
)
failure_proportion_supplier.sort_values(['Batch', 'machine_id', 'Supplier'], inplace=True)
g = sns.catplot(data=failure_proportion_supplier,
x='Batch', y='failure_proportion_supplier',
hue='Supplier', row='machine_id',
kind='bar',
aspect=3,
sharex=False,
sharey=False,)
You must cluster the rows of dfb using the 4 operational variables x1 through x4. You must decide how many clusters to use and describe how you made that choice. You may use KMeans OR Hierarchical clustering. Include any figures that helped you make that choice.
Visualize your cluster analysis results by:
You are interested in the PROPORTION of cell phone cases that failed the DROP TEST. Are any of the clusters associated with higher failure PROPORTIONS than others? Based on your visualizations how would you describe that cluster?
Add as many cells as you see fit to answer this question.
columns_to_use = ['x1', 'x2', 'x3', 'x4', 'Result']
df = dfb[columns_to_use].copy()
df
| x1 | x2 | x3 | x4 | Result | |
|---|---|---|---|---|---|
| 0 | 49.215485 | 105.647327 | 21.410531 | 14.154576 | 1 |
| 1 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 |
| 2 | 49.125629 | 112.323042 | 22.874947 | 13.743788 | 1 |
| 3 | 48.306819 | 109.445722 | 22.910858 | 13.839830 | 0 |
| 4 | 50.564504 | 109.184587 | 22.083064 | 13.954942 | 1 |
| ... | ... | ... | ... | ... | ... |
| 1407 | 49.464765 | 102.967447 | 22.411845 | 13.919089 | 1 |
| 1408 | 49.408191 | 103.516814 | 21.966617 | 13.730119 | 1 |
| 1409 | 49.605196 | 103.460366 | 21.932429 | 13.790280 | 1 |
| 1410 | 49.716703 | 104.346466 | 22.059022 | 13.543388 | 1 |
| 1411 | 49.885679 | 104.690007 | 22.055338 | 13.882302 | 1 |
1412 rows × 5 columns
df['Result'] = df.Result.astype('object')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1412 entries, 0 to 1411 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x1 1412 non-null float64 1 x2 1412 non-null float64 2 x3 1412 non-null float64 3 x4 1412 non-null float64 4 Result 1412 non-null object dtypes: float64(4), object(1) memory usage: 55.3+ KB
df.isna().sum()
x1 0 x2 0 x3 0 x4 0 Result 0 dtype: int64
🛎️ There are no MISSING VALUES in the dataset
df_features = df.select_dtypes('number').copy()
sns.catplot(data = df_features, kind='box', aspect=2)
plt.show()
💡 Since one variable is dominant the data has to be standardized first remove the MAGNITUDE and SCALE effect. KMeans considers SIMILAR to be based on DISTANCE. Distance depends on MAGNITUDE and SCALE
# Using sklearn StandardScaler to standardize the dataset
X = StandardScaler().fit_transform(df_features)
sns.catplot(data = pd.DataFrame(X, columns=df_features.columns), kind='box', aspect=2)
plt.show()
pca = PCA(n_components=2).fit_transform(X)
pca.shape
(1412, 2)
pca_df = pd.DataFrame( pca, columns=['pc01', 'pc02'] )
sns.relplot(data = pca_df, x='pc01', y='pc02')
plt.show()
sns.pairplot(data = pca_df)
plt.show()
fig, ax = plt.subplots()
sns.heatmap( pca_df.corr(numeric_only=True),
vmin=-1, vmax=1, center=0,
cmap='coolwarm',
annot=True, annot_kws={'fontsize': 20}, fmt='.3f',
ax=ax)
plt.show()
pca_df['Result'] = df.Result
Visualizing PCA results with known grouping
sns.lmplot(data = pca_df, x='pc01', y='pc02', hue='Result')
plt.show()
Since the above plot doesn't clearly say which cluster has higher propotion of failure (Result = 0) we need to continue with other clustering techniques like KMeans
clusters_2 = KMeans(n_clusters=2, random_state=121, n_init=25, max_iter=500).fit_predict(X)
df['k2'] = pd.Series( clusters_2, index=df.index ).astype('object')
df.k2.value_counts()
k2 1 746 0 666 Name: count, dtype: int64
sns.pairplot(data = df[['x1','x2','x3','x4','k2']], hue='k2', diag_kws={'common_norm': False}, palette='viridis')
plt.show()
fig, ax = plt.subplots(figsize=(20,5))
sns.heatmap(data = pd.crosstab(df.Result,
df.k2,
margins=True ),
annot=True,
annot_kws={"fontsize": 20},
fmt='g',
cbar=False,
ax=ax)
plt.show()
sns.relplot(data = df, x='x1', y='x2', hue='k2', style='Result')
plt.show()
sns.relplot(data = df, x='x1', y='x3', hue='k2', style='Result')
plt.show()
sns.relplot(data = df, x='x1', y='x4', hue='k2', style='Result')
plt.show()
sns.relplot(data = df, x='x2', y='x3', hue='k2', style='Result')
plt.show()
sns.relplot(data = df, x='x2', y='x4', hue='k2', style='Result')
plt.show()
sns.relplot(data = df, x='x3', y='x4', hue='k2', style='Result')
plt.show()
🛎️ Based on the above plot we cannot determine characteristics of failures (Result=1) just by using 2 clusters. Need to continue analysis with more clusters.
tots_within = []
K = range(1, 31)
for k in K:
km = KMeans(n_clusters=k, random_state=121, n_init=25, max_iter=500)
km = km.fit(X)
tots_within.append( km.inertia_ )
fig, ax = plt.subplots()
ax.plot( K, tots_within, 'bo-' )
ax.set_xlabel('number of clusters')
ax.set_ylabel('total within sum of squares')
plt.show()
Choosing cluster value as 5 based on the above knee bend plot
clusters_5 = KMeans(n_clusters=5, random_state=121, n_init=25, max_iter=500).fit_predict(X)
df['k5'] = pd.Series( clusters_5, index=df.index ).astype('object')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1412 entries, 0 to 1411 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x1 1412 non-null float64 1 x2 1412 non-null float64 2 x3 1412 non-null float64 3 x4 1412 non-null float64 4 Result 1412 non-null object 5 k2 1412 non-null object 6 k5 1412 non-null object dtypes: float64(4), object(3) memory usage: 77.3+ KB
df.k5.value_counts()
k5 2 535 0 371 3 201 4 174 1 131 Name: count, dtype: int64
sns.pairplot(data = df[['x1','x2','x3','x4','k5']], hue='k5', diag_kws={'common_norm': False}, palette='viridis')
plt.show()
fig, ax = plt.subplots(figsize=(20,5))
crosstab = pd.crosstab(df.Result,
df.k5,
margins=True )
sns.heatmap(data = crosstab,
annot=True,
annot_kws={"fontsize": 20},
fmt='g',
cbar=False,
ax=ax)
plt.show()
🛎️ The above heatmap shows that cluster 3 has higher failure propotion than other clusters. We can also confirm that using the following piece of code.
failed_tests_k5 = (df.groupby(['k5'])['Result']\
.value_counts(normalize=True)\
.unstack(fill_value=0)[1]\
.reset_index(name='failure_proportion')
)
failed_tests_k5.sort_values(['k5'], inplace=True)
failed_tests_k5
| k5 | failure_proportion | |
|---|---|---|
| 0 | 0 | 0.185984 |
| 1 | 1 | 0.061069 |
| 2 | 2 | 0.158879 |
| 3 | 3 | 0.965174 |
| 4 | 4 | 0.402299 |
We need to check if clear relationships between operation variables can be established to describe the cluster
sns.relplot(data = df, x='x1', y='x2', hue='k5', style='Result', palette='viridis')
plt.show()
Cell phone cases in cluster 3 have x1 values <= 52 and x2 value >= 100
sns.relplot(data = df, x='x1', y='x3', hue='k5', style='Result', palette='viridis')
plt.show()
Cell phone cases in cluster 3 have x1 values <= 52 and x3 value between 20 and 24 but we can also see this range of x3 is shared by other clusters as well
sns.relplot(data = df, x='x1', y='x4', hue='k5', style='Result', palette='viridis')
plt.show()
Cell phone cases in cluster 3 have x1 values < 50 and x4 value >= 13. A very definition is present here
sns.relplot(data = df, x='x2', y='x3', hue='k5', style='Result', palette='viridis')
plt.show()
Cell phone cases in cluster 3 have x2 between 90 and 110 AND x3 between 20 and 24 but this range is also present in other clusters
sns.relplot(data = df, x='x2', y='x4', hue='k5', style='Result', palette='viridis')
plt.show()
We can see a very clear definition of clutser 3 - higher values of x2 and x4
sns.relplot(data = df, x='x3', y='x4', hue='k5', style='Result', palette='viridis')
plt.show()
We can see a very clear definition of clutser 3 - lower values of x3 but higher values of x4
Although we have a very good KNEE BEND plot above I'm trying additional clustering techniques to see if I can gain additional insights
hclust_ward = hierarchy.ward(X)
fig = plt.figure(figsize=(12, 6))
dn = hierarchy.dendrogram(hclust_ward, no_labels=True )
plt.show()
The above dendrogram shows that there are three clusters colors orange, green and red, are different because they are only merged at very high height
np.unique( hierarchy.cut_tree( hclust_ward, height=35).ravel() )
array([0, 1, 2])
np.unique( hierarchy.cut_tree( hclust_ward, height=40).ravel() )
array([0, 1, 2])
np.unique( hierarchy.cut_tree( hclust_ward, height=55).ravel() )
array([0, 1, 2])
pca_df['hclust_3'] = pd.Series( hierarchy.cut_tree( hclust_ward, height=40).ravel(),
index=pca_df.index ).astype('object')
pca_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1412 entries, 0 to 1411 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 pc01 1412 non-null float64 1 pc02 1412 non-null float64 2 Result 1412 non-null object 3 hclust_3 1412 non-null object dtypes: float64(2), object(2) memory usage: 44.2+ KB
sns.catplot(data = pca_df, x='hclust_3', kind='count')
plt.show()
sns.relplot(data = pca_df, x='pc01', y='pc02', hue='hclust_3')
plt.show()
sns.pairplot(data=pca_df[['pc01','pc02','hclust_3']], hue='hclust_3',
diag_kws={'common_norm': False},
palette='viridis')
plt.show()
COMPARE the identified clusters to the KNOWN categories below
fig, ax = plt.subplots()
sns.heatmap(data = pd.crosstab( pca_df.Result, pca_df.hclust_3, margins=True ),
annot=True, annot_kws={'fontsize': 20}, fmt='g',
cbar=False,
ax=ax)
plt.show()
failed_tests_hclust_3 = (pca_df.groupby(['hclust_3'])['Result']\
.value_counts(normalize=True)\
.unstack(fill_value=0)[1]\
.reset_index(name='failure_proportion')
)
failed_tests_hclust_3.sort_values(['hclust_3'], inplace=True)
failed_tests_hclust_3
| hclust_3 | failure_proportion | |
|---|---|---|
| 0 | 0 | 0.459790 |
| 1 | 1 | 0.402299 |
| 2 | 2 | 0.139640 |
We cannot establish a cluster with clear higher propotio of failure. So we can try increasing the number of clusters
Increasing the cluster to 5
pca_df['hclust_5'] = pd.Series( hierarchy.cut_tree( hclust_ward, height=25).ravel(),
index=pca_df.index ).astype('object')
fig, ax = plt.subplots()
sns.heatmap(data = pd.crosstab( pca_df.Result, pca_df.hclust_5, margins=True ),
annot=True, annot_kws={'fontsize': 20}, fmt='g',
cbar=False,
ax=ax)
plt.show()
failed_tests_hclust_5 = (pca_df.groupby(['hclust_5'])['Result']\
.value_counts(normalize=True)\
.unstack(fill_value=0)[1]\
.reset_index(name='failure_proportion')
)
failed_tests_hclust_5.sort_values(['hclust_5'], inplace=True)
failed_tests_hclust_5
| hclust_5 | failure_proportion | |
|---|---|---|
| 0 | 0 | 0.965174 |
| 1 | 1 | 0.185984 |
| 2 | 2 | 0.402299 |
| 3 | 3 | 0.160448 |
| 4 | 4 | 0.053846 |
🛎️ We can see that cluster 0 has the highest propotion of failures
df_copy = df.copy()
df_copy['hclust_5'] = pca_df.hclust_5.astype('object')
df_copy.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1412 entries, 0 to 1411 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x1 1412 non-null float64 1 x2 1412 non-null float64 2 x3 1412 non-null float64 3 x4 1412 non-null float64 4 Result 1412 non-null object 5 k2 1412 non-null object 6 k5 1412 non-null object 7 hclust_5 1412 non-null object dtypes: float64(4), object(4) memory usage: 88.4+ KB
# Reshape to long format to enable exploring ALL of the original variables GROUPED BY the clusters!
df_copy_lf = df_copy.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=['rowid', 'Result', 'hclust_5'],
value_vars=df_features[['x1', 'x2', 'x3', 'x4']])
df_copy_lf
| rowid | Result | hclust_5 | variable | value | |
|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | x1 | 49.215485 |
| 1 | 1 | 1 | 0 | x1 | 46.887333 |
| 2 | 2 | 1 | 0 | x1 | 49.125629 |
| 3 | 3 | 0 | 0 | x1 | 48.306819 |
| 4 | 4 | 1 | 0 | x1 | 50.564504 |
| ... | ... | ... | ... | ... | ... |
| 5643 | 1407 | 1 | 0 | x4 | 13.919089 |
| 5644 | 1408 | 1 | 0 | x4 | 13.730119 |
| 5645 | 1409 | 1 | 0 | x4 | 13.790280 |
| 5646 | 1410 | 1 | 0 | x4 | 13.543388 |
| 5647 | 1411 | 1 | 0 | x4 | 13.882302 |
5648 rows × 5 columns
sns.catplot(data = df_copy_lf, x='hclust_5', y='value', col='variable', col_wrap=4,
kind='box',
sharey=False)
plt.show()
🛎️ We can see that the summary statistics of variable varies across clusters
sns.pairplot(data = df_copy[['x1','x2','x3','x4','hclust_5']], hue='hclust_5', diag_kws={'common_norm': False}, palette='viridis')
plt.show()
🛎️ The above plot reveals that data points in cluster 0 exhibit unique relationships. And a lot of these relationships are similar to what was seen above in the KMeans clustering